Show the code
source("utils.R")
theme_set(theme_minimal())This website is still under active development - all content subject to change
November 29, 2024
In this vignette we will show:
Multivariate lattice data analysis methods for imaging-based approaches.
This includes global metrics on the entire field of view and local variants thereof.
The use case is a CosMx data set from He et al. (2022).
Complementary resources using this data and methods are found in the Voyager CosMx vignette, Voyager bivariate vignette and Voyager multivariate vignette.
For this representation of cells, we will rely on the SpatialFeatureExperiment package. For preprocessing of the dataset, we refer the reader to the vignette of the voyager package.
class: SpatialFeatureExperiment
dim: 980 100290
metadata(0):
assays(1): counts
rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
rowData names(3): means vars cv2
colnames(100290): 1_1 1_2 ... 30_4759 30_4760
colData names(17): Area AspectRatio ... nCounts nGenes
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : CenterX_global_px CenterY_global_px
imgData names(1): sample_id
unit: full_res_image_pixels
Geometries:
colGeometries: centroids (POINT), cellSeg (POLYGON)
Graphs:
sample01:
[1] 20
In this vignette we are highlighting lattice data analysis approaches for multivariate observations. We will show the metrics related to a ligand-receptor pair, CEACAM6 and EGFR which was identified in the original publication of the CosMx dataset (He et al. 2022).
A lattice consists of individual spatial units \(D = \{A_1, A_2,...,A_n\}\) where the units do not overlap. The data is then a realisation of a random variable along the lattice \(Y_i = Y (A_i)\) (Zuur, Ieno, and Smith 2007). The lattice is irregular, if the units have variable size and are not spaced regularly, such as is the case with cells in tissue.
More details about lattices can be found on here.
One of the challenges when working with (irregular) lattice data is the construction of a neighbourhood graph (Pebesma and Bivand 2023). The main question is, what to consider as neighbours, as this will affect downstream analyses. Various methods exist to define neighbours, such as contiguity-based neighbours (neighbours in direct contact), graph-based neighbours (e.g., \(k\)-nearest neighbours), distance-based neighbours or higher order neighbours (Getis 2009; Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023). The documentation of the package spdep provides an overview of the different methods (Bivand 2022).
We consider first contiguity-based neighbours. As cell segmentation is notoriously imperfect, we add a snap value, which means that we consider all cells with distance 20 or less as contiguous (Pebesma and Bivand 2023; Wang 2019).
colGraph(sfe, "poly2nb") <-
findSpatialNeighbors(sfe,
type = "cellSeg",
method = "poly2nb", # wraps the spdep function with the same name
style = "W",
snap = 20 # all cells with less distance apart are considered contiguous
)
p1 <- plotColGraph(sfe,
colGraphName = "poly2nb",
colGeometryName = "cellSeg",
bbox = c(xmin = 3500, xmax = 10000, ymin = 157200, ymax = 162200)
) + theme_void()Alternatively, we can use a \(k\)-nearest neighbours approach. The choice of the number \(k\) is somewhat arbitrary.
colGraph(sfe, "knn5") <-
findSpatialNeighbors(sfe,
method = "knearneigh", # wraps the spdep function with the same name
k = 5,
zero.policy = TRUE
)
p2 <- plotColGraph(sfe,
colGraphName = "knn5",
colGeometryName = "cellSeg",
bbox = c(xmin = 3500, xmax = 10000, ymin = 157200, ymax = 162200)
) + theme_void()The graphs below show noticeable differences. In the contiguous neighbour graph on the left (neighbours in direct contact), we can see the formation of distinct patches that are not connected to the rest of the tissue. In addition, some cells do not have any direct neighbours. In contrast, the \(k\)-nearest neighbours (kNN) graph on the right reveals that these patches tend to be connected to the rest of the structure.
Here we set the arguments for the examples below.
With a defined spatial weight matrix, one can calculate multivariate spatial metrics. We will consider both global and local bivariate observations as well as local multivariate spatial metrics.
For two continous variables the global bivariate Moran’s \(I\) is defined as (Wartenberg 1985; Bivand 2022)
\[I_B = \frac{\Sigma_i(\Sigma_j{w_{ij}y_j\times x_i})}{\Sigma_i{x_i^2}}\]
where \(x_i\) and \(y_i\) are the two variables of interest and \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\).
The global bivariate Moran’s \(I\) is a measure of correlation between the variables \(x\) and \(y\) where \(y\) has a spatial lag. The result might overestimate the spatial autocorrelation of the variables due to the non-spatial correlation of \(x\) and \(y\) (Bivand 2022).
spdep
DATA PERMUTATION
Call:
boot(data = xx, statistic = bvm_boot, R = nsim, sim = "permutation",
listw = listw, parallel = parallel, ncpus = ncpus, cl = cl)
Bootstrap Statistics :
original bias std. error
t1* -0.1577808 0.1578294 0.002792404
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 499 bootstrap replicates
CALL :
boot::boot.ci(boot.out = res, conf = c(0.99, 0.95, 0.9), type = "basic")
Intervals :
Level Basic
99% (-0.3240, -0.3083 )
95% (-0.3213, -0.3101 )
90% (-0.3201, -0.3110 )
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
The value t0 indicates the test statistic of global bivariate Moran’s \(I\). The global bivariate Moran’s \(I\) value for the genes KRT17, TAGLN is -0.1577808. Significance can be assessed by comparing the permuted confidence interval with the test statistic.
Lee’s \(L\) is a bivariate measure that combines non-spatial pearson correlation with spatial autocorrelation via Moran’s \(I\) (Lee 2001). This enables us to asses the spatial dependence of two continuous variables in a single measure. The measure is defined as
\[L(x,y) = \frac{n}{\sum_{i=1}^n(\sum_{j=1}^nw_{ij})^2}\frac{\sum_{i=1}^n[\sum_{j=1}^nw_{ij}(x_j-\bar{x})](\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\]
where \(w_{ij}\) is the value of the spatial weights matrix for positions \(i\) and \(j\), \(x\) and \(y\), the two variables of interest and \(\bar{x}\) and \(\bar{y}\) their means (Lee 2001; Bivand 2022).
voyager statistic
-0.1527921
[1] 0.998
The effect sice of bivariate Lee’s \(L\) for the genes KRT17, TAGLN is -0.1527921 and the associated p-value is 0.998
Similar to the global bivariate Moran’s \(I\) statistic, there is a local analogue. The formula is given by:
\[ I_i^B = x_i\sum_jw_{ij}y_j \]
This can be interesting in the context of detection of coexpressed ligand-receptor pairs. A method that is based on local bivariate Moran’s \(I\) and tries to detect such pairs is SpatialDM (Li et al. 2023).
voyagersfe_tissue <- runBivariate(sfe, type = "localmoran_bv",
feature1 = features[1], feature2 = features[2],
colGraphName = colGraphName,
nsim = 499)
plotLocalResult(sfe_tissue, "localmoran_bv",
features = localResultFeatures(sfe_tissue, "localmoran_bv"),
ncol = 2, divergent = TRUE, diverge_center = 0,
colGeometryName = colGeometryName, size = 2) Similar to the global variant of Lee’s \(L\) the local variant (Lee 2001; Bivand 2022) is defined as
\[L_i(x,y) = \frac{n(\sum_{j=1}^nw_{ij}(x_j-\bar{x}))(\sum_{j=1}^nw_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\] Local Lee’s \(L\) is a measure of spatial co-expression, when the variables of interest are gene expression measurements. Unlike the gobal version, the variables are not averaged and show the local contribution to the metric. Positive values indicate colocalization, negative values indicate segregation (Lee 2001; Bivand 2022).
voyagersfe_tissue <- runBivariate(sfe, type = "locallee",
feature1 = features[1], feature2 = features[2],
colGraphName = colGraphName)
plotLocalResult(sfe_tissue, "locallee",
features = localResultFeatures(sfe_tissue, "locallee"),
ncol = 2, divergent = TRUE, diverge_center = 0,
colGeometryName = colGeometryName, size = 2) Geary’s \(C\) is a measure of spatial autocorrelation that is based on the difference between a variable and its neighbours. (Anselin 2019, 1995) defines it as
\[c_i = \sum_{j=1}^n w_{ij}(x_i-y_j)^2\]
and can be generalized to \(k\) features (in our case genes) by expanding
\[c_{k,i} = \sum_{v=1}^k c_{v,i}\]
where \(c_{v,i}\) is the local Geary’s \(C\) for the \(v\)th variable at location \(i\). The number of variables that can be used is not fixed, which makes the interpretation a bit more difficult. In general, the metric summarizes similarity in the “multivariate attribute space” (i.e. the gene expression) to its geographic neighbours. The common difficulty in these analyses is the interpretation of the mixture of similarity in the physical space and similarity in the attribute space (Anselin 2019, 1995).
voyagerTo speed up computation we will use highly variable genes.
hvgs <- getTopHVGs(sfe, fdr.threshold = 0.01)
# Subset of the tissue
sfe_tissue <- runMultivariate(sfe, type = "localC_multi",
subset_row = hvgs,
colGraphName = colGraphName)
# Local C mutli is stored in colData so this is a workaround to plot it
plotSpatialFeature(sfe_tissue, "localC_multi", size = 2, scattermore = FALSE)We can further plot the results of the permutation test. Significant values indicate interesting regions, but should be interpreted with care for various reasons. For example, we are looking for similarity in a combination of multiple features but the exact combination is not known. Anselin (2019) write “Overall, however, the statistic indicates a combination of the notion of distance in multi-attribute space with that of geographic neighbors. This is the essence of any spatial autocorrelation statistic. It is also the trade-off encountered in spatially constrained multivariate clustering methods (for a recent discussion, see, e.g., Grubesic, Wei, and Murray 2014).”. Multi-attribute space refers here to the highly variable genes. The problem can be summarised to where the similarity comes from, the gene expression or the physical space (Anselin 2019). The same problem is common in spatial domain detection methods.
plotted are the effect size and the adjusted p-values in space.
This test is useful to assess the overlap of the \(k\)-nearest neighbours from physical distances (tissue space) with the \(k\)-nearest neighbours from the gene expression measurements (attribute space). \(k\)-nearest neighbour matrices are computed for both physical and attribute space. In a second step the probability of overlap between the two matrices is computed (Anselin and Li 2020).
sf <- colGeometries(sfe)[[segmentation]]
sf <- cbind(sf, t(as.matrix(logcounts(sfe)[hvgs,])))
nbr_test <- neighbor_match_test(sf[c(hvgs)], k = 20)
sf$Probability <- nbr_test$Probability
sf$Cardinality <- nbr_test$Cardinality
p <- tm_shape(sf) + tm_fill(col = 'Cardinality')
q <- tm_shape(sf) + tm_fill(col = 'Probability')
tmap_arrange(p,q)Cardinality is a measure of how many neighbours of the two matrices are common. Some regions show high cardinality with low probability and therefore share similarity on both attribute and physical space. In contrast to multivariate local Geary’s \(C\) this metric focuses directly on the distances and not on a weighted average. A problem of this approach is called the empty space problem which states that as the number of dimensions of the feature sets increase, the empty space between observations also increases (Anselin and Li 2020).
In addition to measures of spatial autocorrelation of continuous data as seen above, the join count statistic method applies the same concept to binary and categorical data. In essence, the joint count statistic compares the distribution of categorical marks in a lattice with frequencies that would occur randomly. These random occurrences can be computed using a theoretical approximation or random permutations. The same concept was also extended in a multivariate setting with more than two categories. The corresponding spdep functions are called joincount.test and joincount.multi (Dale and Fortin 2014; Bivand 2022; Cliff and Ord 1981).
spdepFirst, we need to get categorical marks for each data point. We do so by running (non-spatial) PCA on the data
We can visualise the clusters as follows:
The join count statistic is calculated as follows:
Joincount Expected Variance z-value
1:1 2840.500 1444.261 133.249 120.9560
2:2 482.200 135.630 20.661 76.2449
3:3 957.000 415.302 54.326 73.4943
4:4 445.000 112.541 17.444 79.6004
5:5 181.500 19.404 3.326 88.8816
6:6 3376.400 997.107 105.254 231.9149
2:1 654.900 885.392 109.162 -22.0608
3:1 1410.000 1549.192 182.879 -10.2927
3:2 482.000 474.806 68.211 0.8711
4:1 825.800 806.531 99.941 1.9274
4:2 185.400 247.191 38.303 -9.9841
4:3 540.400 432.515 62.572 13.6386
5:1 273.800 334.996 42.759 -9.3585
5:2 91.600 102.671 16.645 -2.7137
5:3 143.500 179.647 27.072 -6.9472
5:4 41.900 93.527 15.290 -13.2031
6:1 64.800 2400.367 267.219 -142.8758
6:2 262.900 735.679 96.167 -48.2108
6:3 253.200 1287.236 159.995 -81.7491
6:4 17.200 670.153 88.114 -69.5602
6:5 72.500 278.351 37.869 -33.4511
Jtot 5319.900 10478.254 402.484 -257.1206
The rows show different combinations of clusters that are in physical contact. E.g. \(1:1\) means the cluster \(1\) with itself. The column Joincount is the observed statistic whereas the column Expected is the expected value of the statistic for this combination. Like this, we can compare whether contacts among cluster combinations occur more frequently than expected at random (Cliff and Ord 1981).
The local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus the underlying process cannot be inferred from characterising the pattern. Indication of clustering does not explain why this occurs. On one hand, clustering can be the result of spatial interaction between the variables of interest. This is the case if a gene of interest is highly expressed in a tissue region. On the other hand, clustering can be the result of spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., when cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] dixon_0.0-8 splancs_2.01-44
[3] sp_2.1-1 bluster_1.10.0
[5] magrittr_2.0.3 stringr_1.5.1
[7] spdep_1.2-8 spData_2.3.0
[9] tmap_3.3-4 scater_1.28.0
[11] scran_1.28.2 scuttle_1.10.3
[13] SFEData_1.2.0 SpatialFeatureExperiment_1.2.3
[15] Voyager_1.2.7 rgeoda_0.0.10-4
[17] digest_0.6.33 sf_1.0-19
[19] reshape2_1.4.4 patchwork_1.3.0
[21] STexampleData_1.8.0 ExperimentHub_2.8.1
[23] AnnotationHub_3.8.0 BiocFileCache_2.8.0
[25] dbplyr_2.3.4 rlang_1.1.4
[27] ggplot2_3.5.1 dplyr_1.1.4
[29] spatstat_3.0-6 spatstat.linnet_3.1-1
[31] spatstat.model_3.2-6 rpart_4.1.19
[33] spatstat.explore_3.3-3 nlme_3.1-162
[35] spatstat.random_3.3-2 spatstat.geom_3.3-4
[37] spatstat.univar_3.1-1 spatstat.data_3.1-4
[39] SpatialExperiment_1.10.0 SingleCellExperiment_1.22.0
[41] SummarizedExperiment_1.30.2 Biobase_2.60.0
[43] GenomicRanges_1.52.1 GenomeInfoDb_1.36.4
[45] IRanges_2.34.1 S4Vectors_0.38.2
[47] BiocGenerics_0.46.0 MatrixGenerics_1.12.3
[49] matrixStats_1.4.1
loaded via a namespace (and not attached):
[1] splines_4.3.1 later_1.3.1
[3] bitops_1.0-9 filelock_1.0.3
[5] tibble_3.2.1 R.oo_1.27.0
[7] polyclip_1.10-7 XML_3.99-0.14
[9] lifecycle_1.0.4 edgeR_3.42.4
[11] lattice_0.21-8 crosstalk_1.2.0
[13] limma_3.56.2 rmarkdown_2.25
[15] yaml_2.3.7 metapod_1.7.0
[17] httpuv_1.6.11 spatstat.sparse_3.1-0
[19] RColorBrewer_1.1-3 DBI_1.2.3
[21] abind_1.4-8 zlibbioc_1.46.0
[23] purrr_1.0.2 R.utils_2.12.3
[25] RCurl_1.98-1.16 rappdirs_0.3.3
[27] GenomeInfoDbData_1.2.10 ggrepel_0.9.4
[29] irlba_2.3.5.1 spatstat.utils_3.1-1
[31] terra_1.7-55 units_0.8-4
[33] goftest_1.2-3 RSpectra_0.16-1
[35] dqrng_0.4.1 DelayedMatrixStats_1.22.6
[37] codetools_0.2-19 DropletUtils_1.20.0
[39] DelayedArray_0.26.7 tidyselect_1.2.1
[41] raster_3.6-26 farver_2.1.2
[43] viridis_0.6.4 ScaledMatrix_1.8.1
[45] base64enc_0.1-3 jsonlite_1.8.9
[47] BiocNeighbors_1.18.0 e1071_1.7-13
[49] ellipsis_0.3.2 tools_4.3.1
[51] ggnewscale_0.4.9 Rcpp_1.0.13-1
[53] glue_1.8.0 gridExtra_2.3
[55] xfun_0.40 mgcv_1.9-1
[57] HDF5Array_1.28.1 withr_3.0.2
[59] BiocManager_1.30.22 fastmap_1.2.0
[61] boot_1.3-28.1 rhdf5filters_1.12.1
[63] fansi_1.0.6 rsvd_1.0.5
[65] R6_2.5.1 mime_0.12
[67] colorspace_2.1-1 wk_0.8.0
[69] tensor_1.5 dichromat_2.0-0.1
[71] RSQLite_2.3.8 R.methodsS3_1.8.2
[73] utf8_1.2.4 generics_0.1.3
[75] renv_1.0.3 class_7.3-22
[77] httr_1.4.7 htmlwidgets_1.6.2
[79] S4Arrays_1.0.6 tmaptools_3.1-1
[81] pkgconfig_2.0.3 scico_1.5.0
[83] gtable_0.3.6 blob_1.2.4
[85] XVector_0.40.0 htmltools_0.5.6.1
[87] scales_1.3.0 png_0.1-8
[89] knitr_1.44 rstudioapi_0.15.0
[91] rjson_0.2.23 curl_6.0.1
[93] proxy_0.4-27 cachem_1.1.0
[95] rhdf5_2.44.0 BiocVersion_3.17.1
[97] KernSmooth_2.23-21 vipor_0.4.5
[99] parallel_4.3.1 AnnotationDbi_1.62.2
[101] leafsync_0.1.0 s2_1.1.4
[103] pillar_1.9.0 grid_4.3.1
[105] vctrs_0.6.5 promises_1.2.1
[107] BiocSingular_1.16.0 beachmat_2.16.0
[109] xtable_1.8-4 cluster_2.1.4
[111] beeswarm_0.4.0 evaluate_0.22
[113] magick_2.8.5 cli_3.6.3
[115] locfit_1.5-9.10 compiler_4.3.1
[117] crayon_1.5.3 labeling_0.4.3
[119] classInt_0.4-10 ggbeeswarm_0.7.2
[121] plyr_1.8.9 stringi_1.8.4
[123] stars_0.6-4 viridisLite_0.4.2
[125] deldir_2.0-4 BiocParallel_1.34.2
[127] munsell_0.5.1 Biostrings_2.68.1
[129] leaflet_2.2.0 Matrix_1.5-4.1
[131] leafem_0.2.3 sparseMatrixStats_1.12.2
[133] bit64_4.5.2 Rhdf5lib_1.22.1
[135] statmod_1.5.0 KEGGREST_1.40.1
[137] shiny_1.7.5.1 interactiveDisplayBase_1.38.0
[139] igraph_1.5.1 memoise_2.0.1
[141] lwgeom_0.2-13 bit_4.5.0
©2024 The pasta authors. Content is published under Creative Commons CC-BY-4.0 License for the text and GPL-3 License for any code.